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1   Introduction* 

A  real-time  system  provides  a  set  of  services  subject  to  timing  con- 
straints. The  design  of  such  system  is  often  expensive  and  error  prone 
because  of  concurrency  and  timing  constraints.  Moreover,  many  highly  reliable 
applications  require  the  use  of  real-time  computer  systems;  such  as  fly-by- 
wire  passenger  aircraft,  nuclear  reactor  safety  systems,  etc.  While  several 
approaches  to  reliable  hardware  exist,  there  are  few  methodologies  that  permit 
the  construction  of  reliable  software.  Fault-tolerant  software  is  required  to 
prevent  system  failure  caused  by  accident,  sabotage,  or  residual 
hardware/software  design  faults  (that  is,  faults  not  detected  by  proof, 
inspection,  or  testing).  This  paper  discusses  fault-tolerance  in  high  level 
programming  languages  for  real-time  systems  and  reviews  the  motivation  and 
design  of  the  deadline  mechanism  [1].  An  application  of  the  mechanism  to  a 
satellite  on-board  computer  system  is  described  which  demonstrates  the  poten- 
tial of  the  deadline  mechanism  for  programming  fault-tolerant  real-time 
processes. 

The  definitions  of  concepts  involved  in  real-time  fault-tolerance  are 
based  on  the  paper  [1].  It  is  assumed  that  timing  constraints  are  imposed 
upon  a  system  by  its  specification.  Interactions  between  the  system  and  an 
environment  are  performed  by  service  components  of  the  system.  These  service 
components  receive  requests  from  the  environment,  perform  a  service,  and  then 
respond  to  the  environment. 

A  system  clock  is  assumed  to  be  an  essential  component  of  a  real-time 
system.  System  time  is  part  of  the  internal  state  of  the  real-time  system. 
It  is  assumed  that  consistent  views  of  the  times  in  external  environments   can 
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be  obtained  as  functions  of  the  system  clock. 

Associated  with  each  service  component  is  a  request  time  (the  value  of 
the  system  clock  when  a  request  is  detected)  and  a  response  time  (the  value  of 
the  system  clock  by  which  a  response  must  be  completed).  A  response  period  is 
the  maximum  allowable  amount  of  system  time  that  can  elapse  between  a  request 
time  for  a  particular  service  and  the  corresponding  response  time  for  that 
service.  The  response  period  is  part  of  the  specification  of  the  system.  A 
request  period  is  the  interarrival  time  of  requests  for  the  service  measured 
with  respect  to  the  system  clock  and  is  also  part  of  the  specification.  A 
deadline  is  the  system  time  by  which  a  system  must  respond  to  a  request  and  is 
the  sum  of  the  request  time  and  response  period  for  a  given  service  component. 
An  execution  period  is  the  maximum  amount  of  system  time  required  to  execute  a 
block  of  instructions  assuming  no  residual  faults  and  uninterruptability. 
This  is  a  measurable  quantity  and  is  determined  for  particular  blocks  of 
instructions  executing  on  specific  processors.  The  execution  period  of  a  ser- 
vice component  is  called  a  service  period. 


2   Problems  Associated  with  Real-Time  Systems. 


2 . 1   Reliability  Issues  in  Real-Time  Systems. 

In  real-time  systems,  software  is  a  vital  component  whose  failure  may 
result  in  catastrophe.  For  example,  in  defense  missile  systems  the  possibil- 
ity of  software  failure  can  make  the  whole  system  unacceptable.  In  an  aircraft 
control  system  a  failure  may  lose  human  lives,  and  in  nuclear  reactor  safety 
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systems  a  failure  may  result  in  a  disaster.  Hence  the  requirement  that  real- 
time systems  meet  their  specifications  is  crucial. 

Reliable  software  has  been  traditionally  obtained  through  fault- 
avoidance  techniques.  Minimizing  system  complexity,  top-down  modular  design, 
structured  programming,  and  proving  program  correctness  are  all  efforts  to 
achieve  this  goal.  However,  a  fault-tolerance  approach  [2]  is  required 
because  of  at  least  the  following  reasons: 

1)  It  is  always  impossible  to  guarantee  the  absence  of  faults, 

2)  The  delays  and  interruptions  of  real-time  systems  (air  traffic  con- 
trol, process  control,  etc. )  caused  by  manual  repair  actions  is 
unacceptable, 

3)  Manual  intervention  is  inaccessible  for  some  systems  (space,  under- 
sea, etc.),  and 

4)  High  cost  of  lost  time  and  of  manual  maintenance  is  unacceptable  in 
many  installations. 

One  critical  requirement  associated  with  real-time  system  reliability  is   that 
the  timing  constraints  are  met. 


2.2   Timing  Faults  in  Real-Time  Systems. 

A  real-time  system  that  can  always  respond  to  a  request  before  the 
deadline  is  characterized  as  timely.  A  timing  failure  occurs  when  a  real-time 
system  violates  one  of  the  timing  constraints  in  its  specification.  A  timing 
fault  causes  a  timing  error,  and  a  timing  error  is  defined  as  information 
representing  an  incorrect  timing  relationship  in  the  internal  state  of  the 
system.   Timing  faults  may  exist  because  of  the  following  situations: 

1)  miscalculation  of  the  execution  periods  of  some  algorithms, 
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2)  transient  hardware  faults, 

3)  unbounded  algorithms  due  to  loops  or  recursions, 

4)  system  overload, 

5)  resource  contention,   either  of   software   resources   or  hardware 
resources  ,  and 

6)  system  deadlock. 

Timing  faults  may  be  included  in  software  despite  control  over  the 
system  specification,  design,  and  implementation.  Timing  errors  may  also  occur 
because  of  hardware  or  some  external  system  failure  not  documented  in  the 
requirement.  In  order  to  cope  with  unexpected  timing  errors,  we  need  an  error 
recovery  mechanism. 


2.3   Problems  with  Assembly  Languages. 

Many  real-time  systems  are  implemented  in  assembly  language.  Some  of 
the  reasons  for  using  assembly  language  are  that  the  interrupts  can  be  con- 
trolled explicitly  with  assembly  language,  machine-dependent  features  are 
easier  to  program,  and  the  execution  time  is  easy  to  estimate  and  usually 
smaller  than  that  of  using  high  level  language.  However,  the  proof  of  correct- 
ness of  an  assembly  language  program  is  difficult  and  the  specification  of 
timing  constraints  in  assembly  language  programs  is  almost  impossible.  Simu- 
lation, repeated  testing,  and  program  redundancy  can  be  used  to  improve  system 
reliability.  However,  simulations  may  be  lengthy  and  it  is  frequently  diffi- 
cult to  simulate  the  system  closely.  Lack  of  time  may  prevent  repeated  testing 
from  attempting  all  the  possible  contingencies  and  input  values.  The  provi- 
sion of  redundancy  in  assembly  language  programs  is  ad  hoc.  Furthermore, 
these  methods  do  not  guarantee  the  timing  constraints  will  be  met.   Finally, 
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systems  developed  in  assembly  language  are  usually  very  expensive. 


2.4   Problems  with  High  Level  Languages. 

With  advancing  technology  in  high  speed  computer  hardware  and  increas- 
ing complexity  of  software  systems,  the  current  trend  is  to  develop  high  level 
programming  languages  to  support  real-time  systems  [3] .  HAL/S  [4]  developed 
by  Intermetric  Inc.  has  been  used  to  program  the  Space  Shuttle  control  system. 
Modula  [5],  Concurrent  Pascal  [6],  ILIAD  [7],  Ada  [8]  and  Path  Pascal  [9,  10] 
are  all  efforts  in  this  direction.  The  advantage  of  high  level  system  program- 
ming languages  is  obvious;  it  facilitates  programming,  understanding,  modifi- 
cation, and  most  importantly,  maintenance  of  the  system.  Other  advantages  are 
portability  and  reliability. 

Reliability  is  greatly  enhanced  by  the  use  of  high  level  languages  for 
real-time  systems  because  program  verification  is  simplified.  Timing  con- 
straints are  an  important  aspect  of  reliability  in  real-time  systems  and 
should  be  rigorously  specified.  Verification  that  these  constraints  are 
satisfied  is,  however,  very  difficult.  It  is  important  to  provide  a  sys- 
tematic approach  to  fault-tolerance  in  highly  reliable  real-time  systems. 
Most  existing  high  level  languages,  however,  do  not  specify  timing  con- 
straints. Furthermore,  there  is  no  generalized  methodology  in  existing  high 
level  languages  to  tolerate  timing  faults  or  to  recover  from  timing  errors. 
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3   Fault-Tolerance  in  High  Level  Languages . 


3. 1   Fault-Tolerance  in  Real-Time  Systems. 

3.1.1   Recovery  Block  Scheme. 

Fault-tolerant  design  techniques  have  been  used  for  a   long  time   to 

improve  hardware  reliability,   but   only  recently  have  they  been  proposed  to 

achieve  high  software  reliability  [11,  12,  13].   The   recovery  block  scheme 

[11,   14]  has  been  proposed  as  a  mechanism  to  support  fault-tolerant  software. 

A  simple  recovery  block  is  shown  below. 

ensure  acceptance_test 
by      primary 
else  by  alternate 
else  error; 

The  acceptance_test  is  an  assertion  about  the  results  computed  by  the  primary 
or  the  alternate.  If  the  acceptance  test  is  false  after  a  primary  execution, 
the  alternate  is  executed.  Errors  detected  implicitly  by  the  hardware  or 
software  during  the  execution  of  the  primary  or  the  alternate  automatically 
set  the  acceptance  test  to  false.  If  both  the  primary  and  alternate  of  a 
recovery  block  fail  to  satisfy  the  acceptance  test,  a  software  error  is  gen- 
erated. The  alternate  performs  the  same  function  as  the  primary  but  may  use  a 
different  algorithm.  A  hardware  device  called  the  recovery  cache  [15]  is  used 
to  restore  the  state  after  a  primary  failure  to  the  state  prior  to  entry  to 
the  recovery  block.  The  recovery  block  scheme  has  been  implemented  in  Sequen- 
tial Pascal  [16]  and  hardware  caches  have  been  constructed  for  the  PDP-11 
[17].  The  recovery  block  scheme  is  applied  by  Hecht  to  real-time  applications 
[18].  Hecht  augments  the  acceptance  test  with  a  watchdog  timer  that  monitors 
the  execution  within  a  specified  response  period.  Problems  arise,  however,  if 
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recovery  blocks  are  applied  to  software  systems  whose  services  must  meet  dead- 
lines. The  deadline  mechanism  [1] ,  inspired  by  the  recovery  block  scheme,  is 
proposed  as  a  means  to  systematically  construct  fault-tolerant  real-time  sys- 
tems with  guaranteed  response  time. 

3.1.2   Deadline  Mechanism. 

The  deadline  mechanism,  although  similar  to  the  recovery  block  scheme, 
performs  a  different  function.  Recovery  blocks  are  insensitive  to  the  passage 
of  time  and  cannot  be  used  to  ensure  that  a  response  is  generated  by  a 
specific  deadline.  (That  is,  it  is  impossible  to  recover  from  a  missed  dead- 
line by  resetting  the  system  clock  and  executing  an  alternate.)  Recovery 
blocks  do  provide  alternate  algorithms  to  cope  with  general  software  faults 
involving  logical  errors  but  can  not  handle  timing  errors  properly.  The  dead- 
line mechanism  includes  scheduling  mechanisms  to  ensure  timely  responses.  The 
recovery  block  and  deadline  mechanism  complement  each  other  and  may  be  nested 
to  provide  tolerance  to  both  timing  and  other  faults. 

The  deadline  mechanism  associates  two  algorithms  with  each  particular 
service  component.  The  "primary"  algorithm  produces  a  better  quality  service 
than  the  "alternate".  The  alternate  is  a  simpler,  deterministic  algorithm 
which  always  produces  an  acceptable  result  in  a  known,  fixed  length  of  time. 
The  deadline  mechanism  provides  fault-tolerance  in  a  real-time  system  by 
ensuring  that  either  the  primary  or  the  alternate  will  complete  before  the 
deadline.  A  set  of  simulations  demonstrating  the  feasibility  of  fault- 
tolerance  in  real-time  systems  is  described  in  [1]  . 

Reliable  scheduling  of  primaries  or  alternates  to  meet  real-time  con- 
straints requires  the  calculation  of  the  execution  period  of  each  algorithm. 
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Accurate  determination  of  execution  periods  is  critical  to  system  performance 
and  reliability.  Anderson  and  Witty  [19]  proposed  placing  upper  bounds  on 
iterative  and  recursive  program  codes.  For  many  desirable  algorithms  such 
upper  bounds  may  not  be  easily  established.  The  "service  period"  of  a  com- 
ponent can  then  be  set,  for  scheduling  purposes,  at  a  figure  that  is  less  than 
the  response  period,  but  no  less  than  the  execution  time  of  the  alternate. 
Since  upper  bounds  on  the  execution  periods  for  the  alternate  algorithms  are 
assumed,  the  deadline  mechanism  has  been  designed  so  that  the  scheduler 
reserves  a  time  for  execution  of  the  alternate  as  requests  arrive.  Primaries 
are  scheduled  in  any  remaining  time  which  is  called  slack  time. 


3.2   Experimental  Language  Construct  for  the  Deadline  Mechanism. 

The  deadline  mechanism  is  a  tool  for  specifying  timing  constraints 
explicitly  in  the  program  and  provides  tolerance  for  timing  faults.  We  have 
incorporated  an  experimental  language  construct  for  the  fault-tolerant  dead- 
line mechanism  into  an  existing  programming  language,  Path  Pascal  [9,  10] . 
Path  Pascal  was  chosen  as  the  vehicle  for  this  experiment  because  of  its  avai- 
lability and  high  level  synchronization  mechanisms.  Path  Pascal  includes  con- 
current processes,  device  programming  features,  a  real-time  clock,  an  encapsu- 
lation mechanism  called  an  "object",  and  Open  Path  Expressions  [20]  for  syn- 
chronization of  processes.  Each  object  consists  of  a  declaration  of  the  data 
structure  together  with  a  set  of  operations  which  access  the  values  of  that 
structure.  Objects  permit  synchronization  to  be  associated  with  shared  data 
and  are  implemented  as  an  extension  of  the  Pascal  structure  type  facility. 
Operations  accessible  outside  of  the  object  are  specified  by  prefixing  the 
declaration  of  an  operation  with  the  reserved  word  "entry".   "Interrupt 
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processes"  permit  device  programming  by  means  of  a  "doio"  statement  similar  to 
that  in  Modula  [5] .  The  real-time  programming  features  include  three  built-in 
routines  "await",  "delay",  and  "time"*  Await  suspends  a  process  until  the 
argument  of  await  corresponds  to  the  system  time.  Delay  suspends  a  process 
for  the  number  of  time  units  specified  in  its  argument.  Time  is  a  function 
which  returns  the  current  value  of  the  system  clock. 

Path  Pascal  has  been  extended  to  include  the  fault-tolerant  deadline 
mechanism.  The  extensions  should  not  be  considered  as  a  language  design, 
rather  as  experimental  features  which  have  allowed  the  programming  of  example 
reliable  real-time  systems.  A  "deadline  process"  provides  a  fault-tolerant 
deadline  service.  The  response  period  to  a  request  is  a  parameter  of  the 
deadline  process.  The  deadline  process  contains  one  "request"  and  one  "ser- 
vice" statement.  The  request  period  is  specified  following  an  "every"  phrase 
in  the  "request"  statement.  The  service  statement  specifies  the  primary  and 
alternate  algorithms.  A  simple  example  of  a  deadline  process  is  shown  below: 

deadline  process  position__update  [response  =  onesecond]  ; 
begin 

every  fiveseconds  request 
get_data_f rom_sensors; 
service  nextposition. compute 
else  nextposition. approximate; 
until  navigation_terminated; 
end; 

where  "nextposition"  is  an  object  containing  the  information  of  positions  to 
be  updated.  "Compute"  and  "approximate"  are  two  entry  procedures  in  the  object 
to  do  the  position  update.  "Compute"  is  a  primary  algorithm  and  "approximate" 
an  alternate  algorithm. 

In  the  example  above,  deadline  process  "position_update"  computes  the 
"nextposition"  every  five  seconds  starting  from  the  instantiation  of  the  dead- 
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line  process.  The  request  statement  reads  data  from  sensors  and  the  service 
statement  calculates  the  next  position.  Assuming  that  the  execution  period  of 
reading  sensor  data  and  the  approximating  algorithm  is  less  than  the  response 
period,  the  deadline  mechanism  ensures  each  request  will  complete  before  its 
deadline. 

The  rate-monotonic  priority  assignment  [21]  is  used  to  schedule  alter- 
nates using  the  response  period  as  a  static  priority,  with  small  values  having 
high  priority.  At  any  given  instant  the  alternate  with  the  highest  priority 
(smallest  response  period)  executes.  This  scheme  requires  preemption  of 
alternates.  It  has  been  shown  that  such  a  system  will  be  timely  if  processor 
utilization  is  less  than  In  2  (approx.  0.69)  [21].  One  advantage  of  this 
method  is  a  reduction  in  scheduling  complexity  and  overhead,  an  aspect  not 
considered  in  [1].  The  primaries  are  scheduled  in  earliest-deadline-first 
order. 

Executions  of  the  primary  and  alternate  within  the  service  period  are 
ordered  alternate  first  (the  first-chance  scheduling  algorithm  [1]).  If  the 
primary  completes  within  its  execution  period,  its  results  are  used  in  prefer- 
ence to  those  of  the  alternate.  If  the  primary  should  fail  to  complete  within 
its  execution  period,  the  results  from  the  alternate  are  used.  A  modified 
recovery  cache  has  been  implemented  in  a  software  interpreter  for  the  language 
and  is  used  to  hold  results  from  the  alternate  until  the  primary  either  fails 
or  completes  successfully. 
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4   Application  to  a_  Satellite  On-board  System. 

A  satellite  on-board  system  has  been  selected  as  an  example  of  a 
real-time  system.  The  on-board  system  requires  high  reliable  hardware  and 
software  to  achieve  a  successful  mission.  Software  failures,  including  timing 
failures,  can  degrade  the  system  performance  and  inhibit  the  success  of  the 
mission.  In  the  worst  case,  it  can  cause  the  loss  of  the  satellite  costing 
tens  of  millions  of  dollars.  We  show  how  the  fault-tolerant  deadline  mechanism 
can  be  applied  to  the  on-board  system  to  tolerate  timing  faults. 


4 . 1   The  Model  of  the  On-board  System. 

The  model  of  the  on-board  system  is  based  on  Multimission  Modular 
Spacecraft  [22]  which  is  the  standard  satellite  for  a  wide  variety  of  earth 
orbiting  missions  in  the  1980's.  The  spacecraft  is  a  zero-momentum,  three 
axis  stabilized  satellite.  The  executive  program  for  its  on-board  system  has 
been  written  in  a  low  level  language  [23] . 

A  block  diagram  of  the  model  is  depicted  in  Fig.  1. 

The  Central  Unit  (CU)  interfaces  with  the  On-Board  Computer  (OBC), 
Remote  Interface  Units  [RIUs],  Command  Demodulator-Decoder  and  the  PCM  Modula- 
tor. From  the  CU,  commands  are  channeled  through  the  Multiplex  Data  Bus  (MDB) 
to  the  RIUs  and  from  there  to  the  various  subsystems.  Telemetry  is  sampled  in 
the  subsystems,  selected  by  the  RIUs,  and  then  passed  to  the  format  generator 
in  the  CU  through  the  MDB.  From  the  CU,  PCM  goes  to  the  Modulator.  Both 
commands  and  PCM  are  channeled  to  the  OBC.  The  OBC  can  communicate  with  all 
satellite  subsystems  through  the  CU,  MDB  and  RIUs. 
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Fig.  1   Satellite  On-Board  System  Model 

Though  the  functions  of  the  OBC  generally  depend  on  the  mission,   the 

attitude   control,   delay  command  process,  statistical  calculation  and  limit 

checking,  power  management,  thermal  control,  and  mission  equipment  management 
are  functions  commonly  required  by  various  missions. 


4.2   Time  Critical  Subsystems. 

Among  the  functions  mentioned  above,  attitude  control  is  the  most  com- 
plicated and  the  most  time  critical,  especially  during  the  initial  acquisition 
phase  when  a  big  attitude  maneuver  is  required.  If  the  satellite  is  in  an 
incorrect  attitude  for  a  long  time  hardware  damage  may  result  from  abnormal 
thermal  conditions.  Furthermore,  the  solar  panel  can  not  provide  power  and 
the  battery  will  be  drained.  The  satellite  can  not  survive  without  battery 
power.  The  satellite  has  very  little  momentum  to  stabilize  its  attitude  and 
is  very  sensitive  to  disturbance  torque.   The  control  loop  must  continuously 
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monitor  disturbance  torque  and  provide  a  cancelling  control  torque  when  neces- 
sary. The  big  attitude  maneuver  requires  a  large  control  torque,  and  control 
loop  failures  during  this  period  may  cause  the  satellite  to  tumble.  If  the 
satellite  tumbles,  the  radio  link  between  the  satellite  and  the  ground  sta- 
tions may  be  lost.  (The  satellite  antenna  beam  must  be  aligned  with  the  ground 
stations.)  Should  this  happen,  the  ground  stations  will  be  unable  to  transmit 
suitable  recovery  commands  to  the  satellite,  the  satellite  will  become  out  of 
control  and  the  mission  will  fail. 

The  attitude  control  subsystem  consists  of  attitude  sensors  (gyros, 
sun  sensor,  and  earth  sensor),  OBC,  and  actuators  (inertial  wheels,  and  gas 
jets).  Gyro  signals  are  read  every  125  ms  and  pre-processed  every  half 
second.  Sun  and  earth  sensor  signals  are  read  every  second  and  sent  to  the 
gyro  calibration  process.  The  gyro  calibration  process  is  executed  every  two 
seconds  to  provide  the  calibration  data  to  the  attitude  determination  process. 
The  attitude  determination  process  is  the  most  essential  and  time  consuming 
process  within  this  subsystem.  This  procedure  is  executed  every  second. 
Actuator  control  signals  are  provided  by  the  control  process  every  second  and 
sent  to  the  inertial  wheels  every  125  ms. 

The  data  flow  between  the  processes  is  shown  in  the  following  timing 
chart  (Fig.  2). 

4. 3   Application  of  the  Deadline  Mechanism  to  the  Attitude  Control  System. 

The  deadline  mechanism  has  been  applied  to  each  process  of  the  atti- 
tude control  system.  For  example,  in  the  attitude  determination  process,  a 
linearized  Kalman  filter  is  used  as  the  primary  procedure.   The  filter  is 
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Fig.  2  Attitude  Control  Timing  Chart 

designed  to  determine  attitude  angles  of  the  satellite  and  their  changing 
rate.  To  achieve  required  accuracy,  the  primary  becomes  rather  complicated. 
The  alternate  is  a  simple  prediction  algorithm  by  a  transition  matrix.  This 
prediction  algorithm  is  accurate  enough  when  the  time  duration  of  the  predic- 
tion is  no  longer  than  4  or  5  seconds. 

A  program  example  for  attitude  determination  is   listed  below.   The 
example  shows  the  structure  of  the  program. 


deadline  process  attitude_determine  [response  =  onesecond] ; 
local  declarations 

procedure  filter; 
begin 

kalman  filter; 
end; 
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procedure  prediction; 
begin 

prediction  by  transition  matrix; 
end; 

begin  (*  attitude_determine  *) 

initializations  of  local  data; 

awalt(specif ied  time) ; 

every  onesecond  request 

service  filter  else  prediction; 

until  terminated; 
end;  (*  attitude_determine  *) 


All  the  other  functions  mentioned  in  the  previous  section  are  simi- 
larly programmed  as  deadline  processes.  The  synchronization  among  deadline 
processes  is  achieved  by  the  "await"  function. 


4.4   Reliability  Improvement  by  the  Deadline  Mechanism. 

The  deadline  mechanism  permits  the  construction  of  highly  reliable 
systems  from  primaries  that  are  orders  of  magnitude  less  reliable  than  the 
requirements.  The  following  argument  is  based  on  several  simplifying  assump- 
tions, however,  we  believe  that  it  demonstrates  the  potential  of  the  mechanism 
for  programming  reliable  systems.   Let  the  timing  failure  probability  of   the 

alternate  and  primary  for  a  unit  of  time  be  P[f  ]  and  P[f  ],  respectively.   We 

a        p 

assume  that  each  f  and  f  are  independent  and  that  the  deadline  mechanism 
implementation  is  reliable.  A  timing  failure  of  an  alternate  causes  a  subsys- 
tem failure.  Since  the  alternate  is  much  simpler  than  the  primary,  the  results 
calculated  by  the  alternate  includes  some  tolerable  inaccuracy.  Consecutive 
primary  failures,  although  tolerated  by  timely  execution  of  the  alternate, 
accumulate  the  inaccuracy  and  produce  unacceptable  results.  Therefore,  m.  con- 
secutive primary  timing  failures  are  considered  to  cause  a  subsystem  failure, 
"m"   is   determined  by   the  quality  of  the  alternate  algorithm  and  the  system 
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requirements.  Then  the  probability  of  the  subsystem  timing  failure  for  a  unit 
of  time,  denoted  by  P[f  ],  is  as  follows: 

P[f  ]  «  P[f  ]  +  P[s]*P[mf  ]  (1) 

S  3  P 

Where  P[mf  ]  is  the  probability  of  m  consecutive  primary  failures  for  a  unit 
of  time  and  P[s  ]  is  the  success  probability  of  the  alternate  for  a  unit  of 
time. 

Since  P[mf  ]  <  {P[f  ] }m,  we  can  rewrite  (1)  as: 
P       P 

P[f  ]  <  P[f  ]  <  P[£ J  +  P[s  ]*{P[f  ]>m (2) 

a       s       a       a      p 

The  satellite  acquisition  maneuver  is  often  designed  to  tolerate 
99.73%  of  errors  occurring  in  each  of  the  maneuver  parameters,  such  as  control 
torque,  disturbance  torque,  etc.,  during  the  maneuver.  (The  value  99.73% 
corresponds  to  an  occurrence  probability  within  3  standard  deviations  on 
either  side  of  the  nominal  under  the  assumption  of  normally  distributed 
errors.)  Considering  the  timing  constraint  of  a  deadline  process  as  a  maneuver 
parameter,  the  probability  of  a  timing  failure  of  a  deadline  process  must  be 
less  than  0.0027  for  the  duration  of  the  maneuver.  We  further  assume  that  the 
duration  of  the  maneuver  is  two  days.   For  the  attitude  determination  process, 

the   required  timing  failure  probability  for  the  request  period  (1  second)  is 

—8 
less  than  1.56*10   .   Here  we  choose  the  request  period  of  the  attitude  deter- 
mination process  as  the  unit  of  time  and  the  following  discussions  on  proba- 
bility are  based  on  this  unit  of  time. 

Without  the  benefit  of  the  deadline  mechanism,  the  timing  failure  pro- 

—8 
bability  of   the  primary,  P[f  ],  must  be  less  than  1.56*10   .  To  estimate  the 

effectiveness  of  the  deadline  mechanism,  we  will  make  the  following  simplifi- 
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cations.  We  can  simplify  equation  (2)  by  replacing  P[s  ]  by  1  (we  assume  that 
P[f  ]  is  very  small.)  Suppose  that  only  two  consecutive  failures  of   the  pri- 

8 

mary  can  be  allowed.   Using  the  required  timing  failure  probability  of  the 

subsystem,  we  can  estimate  the  required  timing  failure  probabilities   of   the 

primaries  and  alternates  that  will  guarantee  the  reliability  of  the  subsystem 
from: 

P[f  ]  <  P[f  ]  <  P[f  ]  +  <P[f  ]>2  <  1.56*10~8 
a       s       a       p 

The  alternate  is  much  simpler  than  the  primary  and  we  assume  that  it   can  be 

constructed  with  a  timing  failure  probability  less  than  that  of  the  subsystem 

—8 
(in  the  order  of  10   ).   The  required  timing  failure  probability  of   the  pri- 
mary can  be  four  order  of  magnitude  greater  than  that  of  the  alternate  or  the 
subsystem. 


5   Conclusions. 

Path  Pascal  has  been  extended  to  include  the  fault-tolerant  deadline 
mechanism.  The  satellite  on-board  system  has  been  simulated  using  the 
extended  Path  Pascal.  The  deadline  mechanism  provides  a  systematic  recovery 
scheme  for  timing  errors.  The  language  has  shown  its  potential  to  describe 
real-time  systems. 

The  required  manpower  to  develop  the  simulation  program  in  extended 
Path  Pascal  was  about  six  man-months.  The  development  of  the  executive  pro- 
gram by  the  0A0  project  using  low  level  language  required  forty-eight  man- 
months   [23] .    Though  the  simulation  program  is  a  simplified  version  of  the 
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executive,  we  believe  that  a  tremendous  coding  and  testing  time  can  be  saved 
by  the  high  level  programming  language  approach. 

Fault-tolerance  in  software  is  needed  in  addition  to  the  fault- 
avoidance  approach.  Indeed,  tolerance  and  avoidance  are  complementary 
approaches  to  the  fault  problems.  We  believe  that  theoretically  and  practi- 
cally the  reliability  of  a  real-time  system  is  considerably  improved  by  apply- 
ing the  deadline  mechanism.  Even  though  overhead  is  incurred  for  error 
recovery  and  executing  additional  algorithms  [24] ,  the  mechanism  achieves  high 
reliability  for  time  critical  components  of  real-time  systems. 
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